Lecture ??

Model Comparison and Model evaluation

Sara Martino

Dept. of Mathematical Science, NTNU

Janine Illian

University of Glasgow

Motivation

“All models are wrong, some models are useful” — George Box

  1. How do we check that our model fits our data? - Model Validation
  1. How do we choose the best model? - Model Comparison

But..it is not always easy to distinguish between the two!

Bayesian model comparison

Bayesian Model Comparison

  • There is no golden standard
  • It really depends what you want to do!
  • Basically two types
    • Ones that look at the posterior probability of the data under the model
    • Ones that look at how model the data fits the data
  • In general it is not an easy task
  • inlabru provides some options available

Model Comparison/Validation in inlabru

  • Criteria of fit

    • Marginal likelihood ⇒Bayes factors
    • Deviance information criterion (DIC)
    • Widely applicable information criterion (WAIC)
  • There are also some predictive checks for the model:

    • Conditional predictive ordinate (CPO)
    • Probability integral transform (PIT)
  • You can sample from the posterior using generate() and compute

    • log-scores
    • CRPS

Criteria of fit

Marginal likelihood

# tell inlabru you want to compute mlik
bru_options_set(control.compute = list(mlik = TRUE))
fit = bru(cmp, lik)
# see the results
fit$mlik
                                           [,1]
log marginal-likelihood (integration) -153.0651
log marginal-likelihood (Gaussian)    -153.4508

  • Calculates \(\log(\pi(\mathbf{y}))\)
  • Can calculate Bayes factors through differences in value
  • NB: Problematic for intrinsic models

Deviance Information Criteria (DIC)

# tell inlabru you want to compute DIC
bru_options_set(control.compute = list(dic = TRUE))
fit = bru(cmp, lik)
# see the results
fit$dic$dic
[1] 267.94

  • Measure of complexity and fit.
  • Defined as: \(\text{DIC} = \bar{D} + p_D\)
    • \(\bar{D}\) is the posterior mean of the deviance
    • \(p_D\) is the effective number of parameters.
  • Smaller values indicate better trade-off between complexity and fit.

Widely applicable information criterion (WAIC)

also known as Watanabe–Akaike information criterion.

# tell inlabru you want to compute WAIC
bru_options_set(control.compute = list(waic = TRUE))
fit = bru(cmp, lik)
# see the results
fit$waic$waic
[1] 267.9584

  • similar to the DIC…but maybe better1
  • Linked to the leave-one-out crossvalidation
  • Smaller values indicate better trade-off between complexity and fit

Posterior predictive checks

Posterior predictive distribution

Remember: Our GLM is defined as: \[ \begin{eqnarray} \pi(\mathbf{y}|\mathbf{u},\theta) & =\prod_i \pi(y_i|\mathbf{u},\theta)& \ \text{ likelihood}\\ \pi(\mathbf{u}|\theta)& &\ \text{ LGM}\\ \pi(\theta )& &\ \text{ hyperprior}\\ \end{eqnarray} \] using inlabru we estimate the posterior distribution \(\pi(\mathbf{u},\theta|\mathbf{y})\).

The posterior predictive distribution for a new data \(\hat{y}\) is then: \[ \pi(\hat{y}|\mathbf{y}) = \int\pi(y_i|\mathbf{u},\theta)\pi(\mathbf{u},\theta|\mathbf{y})\ d\mathbf{u}\ d\theta \]

This ditribution can be used to check the model fit!

Posterior predictive distribution

\[ \pi(\hat{y}|\mathbf{y}) = \int\pi(y_i|\mathbf{u},\theta)\pi(\mathbf{u},\theta|\mathbf{y})\ d\mathbf{u}\ d\theta \]

NOTE: In general this is NOT computed by inlabru but needs to be approximated

  • Use generate to sample from the posterior \((\mathbf{u}^*_i,\theta^*_i)\sim\pi(\mathbf{u},\theta|\mathbf{y})\)

  • Simulate a new datapoint \(y^*_i\sim(y_i|\mathbf{u}^*_i,\theta^*_i)\)

  • Use \(y^*_1,\dots, y^*_N\) to approximate the posterior predictive distribution.

BUT inlabru computes automatically two quantities that are useful for model check!

  • Conditional predictive ordinate (CPO)
  • Probability integral transform (PIT)

Conditional predictive ordinate (CPO)

Definition: \[ cpo_i = \pi(y^{obs}_i|\mathbf{y}_{-i}) \]

  • Introduced in Pettit (1990)1
  • Measures fit through the predictive density
  • Can be used to compute the log-score as \[ \text{Score} = -\sum \log(cpo_i) \] lower score correspond to better models

Probability integral transform (PIT)

How to compute:

# tell inlabru you want to compute DIC
bru_options_set(control.compute = list(cpo = TRUE))
fit = bru(cmp, lik)
# see the results
head(fit$cpo$cpo)
[1] 0.2911299 0.1055250 0.4191026 0.3546576 0.1097256 0.4343125

Note it is possible to check for possible fails in computed CPOs

head(fit$cpo$failure)
[1] 0 0 0 0 0 0

Probability integral transform (PIT)

Definition: \[ pit_i = \text{Prob}(\hat{y}_i<y_i|\mathbf{y}_{-i}) \]

  • Linked to leave-one-out cross-validation

  • \(pit_i\) shows how well the ith data point is predicted by the rest of the data

  • Very small values indicate “suprising” observation under the model

  • For well-calibrated, the PIT values should be approximately uniformly distributed.

Probability integral transform (PIT)

How to compute:

# tell inlabru you want to compute DIC
bru_options_set(control.compute = list(cpo = TRUE))
fit = bru(cmp, lik)
# see the results
head(fit$cpo$pit)
[1] 0.1827116 0.9558604 0.5862908 0.7354538 0.9533412 0.4384053

Good and Bad PIT plots

Other scores

In the literature there are many proposed scores for evaluate predictions. For example:

  • Dawid-Sebastian score
  • Log-score
  • Continuous rank probility score (CRPS)
  • Brier score

They all have their strength and wakness and which one is better depends on the goals of the model.

inlabru does not provide such scores automatcally, but they can be computed using simulations from the posterior distribution.

Example: CRPS for Poisson data

Our model: \[ \begin{eqnarray} y_i|\lambda_i & \sim \text{Poisson}(\lambda_i),&\ i = 1,\dots,N_{\text{data}}\\ \log(\lambda_i) = \eta_i &= \beta_0 + \beta_1 x_i \end{eqnarray} \]

Simulate data and fit the model:

df_pois <- data.frame(
  x = rnorm(50),
  y = rpois(length(x), exp(2 + 1 * x))
)
cmp = ~ Intercept(1) + cov(x, model = "linear")
lik = bru_obs(formula = y~.,
              family = "poisson",
              data = df_pois)
fit_pois <- bru(cmp, lik)

Example: CRPS for Poisson data

The CRPS score is defined as: \[ \text{S}_{\text{CRPS}}(F_i, y_i) = \sum_{k=0}^\infty\left[\text{Prob}(Y_i\leq k|\mathbf{y})-I(y_i\leq k)\right]^2 \]

Computational algorithm:

  1. Simulate \(\lambda^{(j)}\sim p(\lambda|\text{data}), j = 1,\dots, N_{\text{samples}}\) using generate() (size \(N\times N_\text{samples}\)).

  2. For each \(i=1,\dots,N_{\text{data}}\), estimate \(r_{ik}=\text{Prob}(Y\leq k|\text{data})-I(y_i\leq k)\) as \[ \hat{r}_{ik} = \frac{1}{N_\text{samples}} \sum_{j=1}^{N_\text{samples}} \{ \text{Prob}(Y\leq k|\lambda^{(j)}_i)-I(y_i\leq k) \} . \]

  3. Compute \[ S_\text{CRPS}(F_i,y_i) = \sum_{k=0}^{K} \hat{r}_{ik}^2 \]

Example: CRPS for Poisson data

Implementation:

# some large value, so that 1-F(K) is small
max_K <- ceiling(max(df_pois$y) + 4 * sqrt(max(df_pois$y)))
k <- seq(0, max_K)
kk <- rep(k, times = length(df_pois$y))
i <- seq_along(df_pois$y)
pred_pois <- generate(fit_pois, df_pois,
  formula = ~ {
    lambda <- exp(Intercept + x)
    ppois(kk, lambda = rep(lambda, each = length(k)))
  },
  n.samples = 2000
)
results <- data.frame(
  i = rep(i, each = length(k)),
  k = kk,
  Fpred = rowMeans(pred_pois),
  residuals =
    rowMeans(pred_pois) - (rep(df_pois$y, each = length(k)) <= kk)
)

crps_scores <-
  (results %>%
    group_by(i) %>%
    summarise(crps = sum(residuals^2), .groups = "drop") %>%
    pull(crps))
summary(crps_scores)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4257  3.3576  7.0492 13.0760 20.5919 61.6573 

..but how about residuals??

Residuals: Frequentist vs Bayesian

🎯 Frequentist View

  • Model parameters are fixed but unknown.
  • Fitted values (predictions): \(\hat{y}_i\) are point estimates
  • Residuals:
    \[ r_i = y_i - \hat{y}_i \]
  • Single number per data point.
  • Used for:
    • Checking model fit / outliers

🔮 Bayesian View

Residuals: Frequentist vs Bayesian

🎯 Frequentist View

  • Model parameters are fixed but unknown.
  • Fitted values (predictions): \(\hat{y}_i\) are point estimates
  • Residuals:
    \[ r_i = y_i - \hat{y}_i \]
  • Single number per data point.
  • Used for:
    • Checking model fit / outliers

🔮 Bayesian View

  • Parameters are random variables with posterior \(\pi(\theta \mid y)\).
  • Predictions \(\tilde{y}_i\) also have a posterior distribution.
  • No single “true” fitted value → residuals are not uniquely defined.
    • \(r_i^{(\text{mean})} = (y_i - E[\tilde{y}_i \mid y])\) (mean residual)
    • \(r_i^{(\text{sample})} = (y_i - \tilde{y}^s)\) for posterior sample \(s\)
    • Distribution of \(y_i - \tilde{y}_i^{(s)}\) (posterior residuals)

A better option is to use posterior predictive checks1

One example of posterior predictive checks

\[ y_i|\eta_i\sim\mathcal{N}(\eta_i, \sigma^2) \]

  • Model 1 \[ \eta_i = \beta_0 + \beta_1 x_i \]
  • Model 2 \[ \eta_i = \beta_0 + f(x_i) \]

One example of posterior predictive checks

  1. Sample \(y^{1k}_i\sim\pi(y_i|\mathbf{y})\) \(k = 1,\dots,M\) using generate()
Samples
1 2 3 4 ... M
$y_1$ 4.79 5.03 4.37 5.18 ... 5.07
$y_3$ 2.82 2.64 2.93 3.39 ... 3.81
... ... ... ... ... ... ...
$y_N$ 4.45 3.88 4.02 4.43 ... 4.8
  1. Compare some summaries of the simulated data with the one of the observed one

One example of posterior predictive checks

Here we compare the estimated posterior densities \(\hat{\pi}^k(y|\mathbf{y})\) with the estimated data density

This is just a simple example, but more complex checks can be computed with the same idea!

Leave Group Our Cross-Validation

This is a new option for cross-validation in inlabru . . .

  • Leave-one-out cross-validation (LOOCV) is a very common technique to evaluate predictions from models
  • When data are correlated (as in spatial statistics) LOOCV might be too optimistic and overestimate model performances.
  • One possible solution is to then remove “chunck(s)” of data (for example one station and all its nearest neighbours)
  • This is the solution implemented in the inla.group.cv() function1

Model validation for LGCP

Model validation for LGCP

Point processes (and so LGCP) are different from all other spatial models:

  • The data are presence and absence of points
  • The likelihood depends on the data location and the integrated intensity over the whole domain

This makes model evaluation especially challenging. Especially cross-validation based measures are hard to define… why?

Cross validation for point processes

  • In a Point process also empty areas are “data”
  • We cannot just remove points as this will change the underlying intensity
  • We need to remove a whole subdomain in order to cross-validate

A warning note!

Warning⚠️

Do not use WAIC and DIC as computed today by inlabru to compare LGCP models

  • WAIC is linked to leave-one-out crossvalidation therefore it is ill-defined for point processes
  • DIC is ok “in theory” but not the way it is computed today

Work is going on about this and measures of fit for LGCP will soon be available in inlabru

Some ideas on what one can do

  • It is possible to define residuals for PP, for example as: \[ \hat{R}_B = n(B)- \int_B\hat{\lambda}(s)\ ds \] where
  • \(n(B)\) is the number of observed points in \(B\)
  • \(\hat{\lambda}(s)\) is the estimated intensity

These residuals can be used to evaluate the model.

This is still work in progress and at the moment non easily available… but it will be 😀

Here you can see some examples of computation and use.

Conclusion

Take home messages

  • Model check and model comparison are complex topics
  • There are no universal solutions, it all depends on which model characteristics you are interested in.
  • inlabru provides some easy to compute alternatives
  • LGCP require own tools to validate the model
    • work is ongoing here… stay tuned 😀